By the end of the lab, you will be able to …
Download and open code-along-02.qmd
Load the standard packages.
Install and load the summarytools package.
You can (and should) make comments in your code
Operators in R are symbols directing R to perform various kinds mathematical, logical, and decision operations. A few of the key ones to know before we get started:
To test equality or inequality:
==, !=, >, >=, <, <=
To indicate “and”, “or”, and “not”:
& | !
Assigning values to various data objects: <- -> =
| operator | definition |
|---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
Generally useful in a filter() but will come up in various other places as well…
| operator | definition |
|---|---|
x & y |
is x AND y? |
x \| y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? (only makes sense if x is TRUE or FALSE) |
Most tasks related to data analysis are not glorious or fancy.
A lot of your time is dedicated to whipping your dataset into the shape that you need to be able to analyze it.
This task has different names “data cleaning,” “data management,” “data manipulation,” “data wrangling,” “data transformation.”
dplyr packageThe dplyr package provides a complete set of functions that help you solve the most common data manipulation challenges such as:
Functions are (most often) verbs, followed by what they will be applied to in parentheses:
|>The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
dplyr styleIn data transformation pipelines, always use a
|>|>We’ll talk about data visualization pipes later…
Heads Up!
|> (native pipe operator) and %>% (magrittr package) behave identically for simple cases.
dplyr grammarWhat’s the advantage of dplyr grammar? We can sequence data manipulation!
# A tibble: 2 × 10
sex variable mean sd min med max n.valid n pct.valid
<dbl+lbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 [male] sex 41.7 13.7 0 40 89 869 1467 59.2
2 2 [female] sex 37.3 13.7 0 40 89 891 1823 48.9
dplyr basicsdplyr verbs (functions) will allow you to solve the vast majority of your data manipulation challenges. They are organized into four groups based on what they operate on: rows, columns, groups, or tables.
The verbs all have in common:
filter()tibbleLet’s make a tiny data frame to use as an example:
# A tibble: 5 × 2
x y
<dbl> <chr>
1 1 a
2 2 a
3 3 b
4 4 c
5 5 c
Heads Up!
A tibble is a modern data frame, often used in the tidyverse and ggplot2 packages.
Remember, you can access the variables (i.e., columns) using the $ operator, as shown using the table() function.
The variable names are case sensitive. In this dataset, all variables are lowercase.
195 respondents were coded as 2 on this variable. What does that mean?
classes (character, factor, numeric)
DICHOTOMOUS (aka binary) A variable with only two categories.
NOMINAL A variable made up of categories that cannot be ordered according to rank.
ORDINAL A variable made up of ranked categories, but there is no systematic and measurable numeric difference between the categories.
INTERVAL-RATIO A variable with categories that are rank-ordered and expressed in the same units.
Political polarization is high in the U.S. today and attitudes about gender and family behavior have been heavily debated.
Using the most recent survey, do more liberals than conservatives think sex before marriage is ‘not wrong at all’?
How do we find out?
Let’s familiarize ourselves with the premarsx and polviews variables.
In the console, type ?premarsx and hit enter. The Help pane will show you the question text, response options and values.
Now, do the same for polviews.
Run this code to see the frequency table for the premarsx variable. Then, add a line below to also see a table for the polviews variable.
The table command also let’s you create a table with two variables.
Use haven::as_factor to see the value labels instead of the value numbers. Then, do the same for polviews.
always wrong almost always wrong
357 122
wrong only sometimes not wrong at all
258 1378
other iap
0 1126
don't know I don't have a job
50 0
dk, na, iap no answer
0 6
not imputable refused
0 0
skipped on web uncodeable
12 0
not available in this release not available in this year
0 0
see codebook
0
extremely liberal liberal
140 421
slightly liberal moderate, middle of the road
368 1148
slightly conservative conservative
381 516
extremely conservative don't know
186 99
iap I don't have a job
0 0
dk, na, iap no answer
0 20
not imputable refused
0 0
skipped on web uncodeable
30 0
not available in this release not available in this year
0 0
see codebook
0
Let’s clean up the levels for premarsx.
gss24$premarsx <- zap_missing(gss24$premarsx)
gss24$premarsx <- as_factor(gss24$premarsx)
table(gss24$premarsx)
always wrong almost always wrong wrong only sometimes
357 122 258
not wrong at all other
1378 0
Let’s get rid of the empty levels in premarsx.
always wrong almost always wrong wrong only sometimes
357 122 258
not wrong at all
1378
For polviews, let’s combine categories to ease interpretation. This is easiest when the levels are numeric.
Let’s remind ourselves what the values correspond with each label.
[1] extremely liberal [2] liberal
140 421
[3] slightly liberal [4] moderate, middle of the road
368 1148
[5] slightly conservative [6] conservative
381 516
[7] extremely conservative [NA] don't know
186 99
[NA] iap [NA] I don't have a job
0 0
[NA] dk, na, iap [NA] no answer
0 20
[NA] not imputable [NA] refused
0 0
[NA] skipped on web [NA] uncodeable
30 0
[NA] not available in this release [NA] not available in this year
0 0
[NA] see codebook
0
gss24 <- gss24 |>
mutate(pol3cat = case_when(
polviews >= 1 & polviews <= 3 ~ "Liberal",
polviews == 4 ~ "Moderate",
polviews >= 5 & polviews <= 7 ~ "Conservative",
TRUE ~ NA_character_),
pol3cat = factor(pol3cat,
levels = c("Liberal", "Moderate", "Conservative"))
)polviews
can be written as |> or %>%
Always double check your work.
Make a frequency table. One of summarytools main purposes is to help cleaning and preparing data for further analysis. Pay attention to the missing values. Then, do the same for premarsx.
Frequencies
gss24$pol3cat
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
------------------ ------ --------- -------------- --------- --------------
Liberal 929 29.40 29.40 28.07 28.07
Moderate 1148 36.33 65.73 34.69 62.77
Conservative 1083 34.27 100.00 32.73 95.50
<NA> 149 4.50 100.00
Total 3309 100.00 100.00 100.00 100.00
Frequencies
gss24$premarsx
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
-------------------------- ------ --------- -------------- --------- --------------
always wrong 357 16.88 16.88 10.79 10.79
almost always wrong 122 5.77 22.65 3.69 14.48
wrong only sometimes 258 12.20 34.85 7.80 22.27
not wrong at all 1378 65.15 100.00 41.64 63.92
<NA> 1194 36.08 100.00
Total 3309 100.00 100.00 100.00 100.00
Using report.nas = FALSE suppresses the missing data.
The headings = FALSE parameter suppresses the heading section. Do the same for premarsx.
Based on your table, what percentage of respondents believe sex before marriage is ‘almost always wrong’?
Based on your table, what percentage of respondents believe sex before marriage is ‘always’ or ‘almost always wrong’?
The table() function gives us the frequencies.
Liberal Moderate Conservative
always wrong 32 78 229
almost always wrong 21 44 51
wrong only sometimes 58 91 100
not wrong at all 505 488 331
We want to add the column percentages…
What’s your conclusion to our initial research question?
% who think sex relations before marriage is __________, by political views
Cross-Tabulation, Column Proportions
premarsx * pol3cat
Data Frame: gss24
---------------------- --------- -------------- -------------- -------------- ---------------
pol3cat Liberal Moderate Conservative Total
premarsx
always wrong 32 ( 5.2%) 78 ( 11.1%) 229 ( 32.2%) 339 ( 16.7%)
almost always wrong 21 ( 3.4%) 44 ( 6.3%) 51 ( 7.2%) 116 ( 5.7%)
wrong only sometimes 58 ( 9.4%) 91 ( 13.0%) 100 ( 14.1%) 249 ( 12.3%)
not wrong at all 505 ( 82.0%) 488 ( 69.6%) 331 ( 46.6%) 1324 ( 65.3%)
Total 616 (100.0%) 701 (100.0%) 711 (100.0%) 2028 (100.0%)
---------------------- --------- -------------- -------------- -------------- ---------------